Isolation forests are a method using tree-based decisionmaking to separate observations instead of grouping them. You might visualize this in tree form:

If you prefer to think about the points in two dimensional space, you can also use something like this:

Here you can see that a highly anomalous observation is easily separated from the bulk of the sample, while a non-anomalous one requires many more steps to isolate.
Today we are going to implement this modeling approach using a sample of data from Spotify- song characteristics. We'll be using these libraries:
From Kaggle - tracks on Spotify
https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data.csv
Let's identify really unusual tracks!
dataset = pd.read_csv("/Users/skirmer/projects/isolationforests/data/data.csv")
Characteristics of the tracks:
dataset.columns
Index(['acousticness', 'artists', 'danceability', 'duration_ms', 'energy',
'explicit', 'id', 'instrumentalness', 'key', 'liveness', 'loudness',
'mode', 'name', 'popularity', 'release_date', 'speechiness', 'tempo',
'valence', 'year'],
dtype='object')
Time to look at examples! Let's see some highly instrumental tracks.
dataset.query("instrumentalness > .94")[['artists', 'name', 'year']].sample(frac=0.001, replace=False, random_state=3)
| artists | name | year | |
|---|---|---|---|
| 127348 | ['M. K. Thyagaraja Bhagavathar'] | Eesan Kaanmino | 1939 |
| 138009 | ['Johann Sebastian Bach', 'Glenn Gould'] | Partita No. 3 in A Minor, BWV 827: II. Allemande | 1957 |
| 101081 | ['Miles Davis'] | Évasion de Julien - BOF "Ascenseur pour l'écha... | 1958 |
What about very "speechy" ones?
dataset.query("speechiness > .90")[['artists', 'name', 'year']].sample(frac=0.001, replace=False, random_state=42)
| artists | name | year | |
|---|---|---|---|
| 89377 | ['Ernst H. Gombrich', 'Christoph Waltz'] | Eine kurze Weltgeschichte fĂĽr junge Leser: Von... | 1936 |
| 36198 | ['Rodney Dangerfield'] | No Respect | 1980 |
| 17345 | ['Kurt Heynicke'] | Kapitel 21 - OktoberfrĂĽhling | 1952 |
Finally let's poke at the loud ones.
dataset.query("loudness > .85")[['artists', 'name', 'year']].sample(frac=0.2, replace=False, random_state=999)
| artists | name | year | |
|---|---|---|---|
| 93131 | ['The Stooges'] | Search and Destroy - Iggy Pop Mix | 1973 |
| 54020 | ['Sergio Vega "El Shaka"'] | Musico, Poeta y Loco | 2004 |
Yeah, that tracks! Sounds all right, that tells us something about these songs.
This is going to be very minimal, because we want to get right to the model. One thing I'm doing is binning the years, and cutting off songs before 1960, just because the data is a little different before that time. We want to find songs that are truly unusual, not just weird data collection.
dataset = dataset.query('year > 1960')
dataset['year_bin'] = pd.cut(dataset['year'],9)
This leaves us with this feature set, for all songs listed after 1960. Also, I'm downsampling for the graphs (only those, not for the modeling) to allow us to visualize things more easily.
features = ['acousticness', 'explicit', 'duration_ms','instrumentalness', 'key',
'liveness', 'loudness', 'mode', 'popularity','valence',
'speechiness', 'tempo', 'energy', 'danceability']
I like to put my model training steps in a function- it keeps things tidy.
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
def fit_model(dataframe, feature_list):
''' Accepts dataframe and list of features to be used in model training.
Returns trained model object and dataset with NAs removed and anomalousness
features appended. '''
# Handle missingness
nd = dataframe.dropna()
#nd = dataframe.fillna(value=0)
# Train the model
X_train2 = nd.loc[:,feature_list]
clf = IsolationForest(n_estimators=300,
#max_features=5,
max_samples=300,
contamination=.02,
bootstrap=True,
verbose=0,
random_state=rng)
clf.fit(X_train2)
decfn = clf.decision_function(X_train2)
scores = clf.score_samples(X_train2)
# Predict, then append predictions and some other features.
y_pred_train = clf.predict(X_train2)
X_train2.loc[:,'preds'] = y_pred_train
X_train2.loc[:,'preds_txt'] = X_train2.loc[:,'preds'].replace({-1:"Anomaly", 1: "Normal"})
X_train2.loc[:,'decision_fn'] = decfn
X_train2.loc[:,'scores'] = scores
X_train2.loc[:,'year_bin'] = nd.loc[:,'year_bin']
# Return model and the augmented data.
return(clf, X_train2)
def predict_on_new(newdata, feature_list, modelobj):
''' Accepts dataframe, trained model object, and list of features required by model.
Returns dataset with NAs removed and anomalousness features appended. '''
# Handle missingness
nd = newdata.dropna()
#newdata.fillna(value=0, inplace=True)
# Predict
dataframe = nd.loc[:, feature_list]
decfn = modelobj.decision_function(dataframe)
scores = modelobj.score_samples(dataframe)
y_pred_test = modelobj.predict(dataframe)
# Augment dataframe
dataframe.loc[:,'preds'] = y_pred_test
dataframe.loc[:,'preds_txt'] = dataframe.loc[:,'preds'].replace({-1:"Anomaly", 1: "Normal"})
dataframe.loc[:,'decision_fn'] = decfn
dataframe.loc[:,'scores'] = scores
dataframe.loc[:,'year_bin'] = nd.loc[:,'year_bin']
# Return augmented data
return(dataframe)
X_train, X_test = train_test_split(dataset, test_size=0.3)
modobj, new_X_train= fit_model(X_train, features)
new_X_train[['preds','preds_txt','decision_fn','scores']].head()
| preds | preds_txt | decision_fn | scores | |
|---|---|---|---|---|
| 11531 | 1 | Normal | 0.081888 | -0.499552 |
| 151682 | 1 | Normal | 0.067252 | -0.514188 |
| 147888 | 1 | Normal | 0.118299 | -0.463142 |
| 111807 | 1 | Normal | 0.126708 | -0.454732 |
| 63004 | 1 | Normal | 0.115847 | -0.465593 |
new_X_test = predict_on_new(newdata=X_test, feature_list=features, modelobj=modobj)
new_X_test[['preds','preds_txt','decision_fn','scores']].head()
| preds | preds_txt | decision_fn | scores | |
|---|---|---|---|---|
| 55067 | 1 | Normal | 0.138246 | -0.443194 |
| 104078 | 1 | Normal | 0.117361 | -0.464080 |
| 96192 | 1 | Normal | 0.114404 | -0.467037 |
| 37647 | 1 | Normal | 0.132059 | -0.449381 |
| 42492 | 1 | Normal | 0.148011 | -0.433429 |
I'm going to use Shapley values here to evaluate which features are turning out to be most helpful in isolating our observations. Notice that the number of pertinent features is going to equal the number of features that we selected in the model training.
X_explain = new_X_test.drop(['preds', 'preds_txt', 'year_bin', 'scores', 'decision_fn'], axis=1)
shap_values = shap.TreeExplainer(modobj).shap_values(X_explain)
shap.summary_plot(shap_values, X_explain)
new_X_test['preds'].value_counts(normalize=True) * 100
1 98.075048 -1 1.924952 Name: preds, dtype: float64
new_X_train['preds'].value_counts(normalize=True) * 100
1 97.999021 -1 2.000979 Name: preds, dtype: float64
output_notebook()
# create a new plot with a title and axis labels
p = figure(title="Training Sample Score", y_axis_label='speechiness Score', x_axis_label='Tempo, bpm', width=700, height = 300)
source = ColumnDataSource(data=dict(x=new_X_train['tempo'],
y=new_X_train['speechiness'],
ur = new_X_train['preds_txt'],
legend_group= new_X_train['preds_txt']))
colors = factor_cmap('ur', palette=Spectral6, factors=new_X_train.preds_txt.unique())
p.circle(color=colors, legend_group = 'legend_group', source=source, fill_alpha=.75, line_alpha = .5)
show(p)
# create a new plot with a title and axis labels
p = figure(title="Test Sample Score", y_axis_label='liveness Score', x_axis_label='energy', width=700, height = 300)
source = ColumnDataSource(data=dict(x=new_X_test['energy'],
y=new_X_test['liveness'],
ur = new_X_test['preds_txt'],
legend_group= new_X_test['preds_txt']))
colors = factor_cmap('ur', palette=Spectral6, factors=new_X_test.preds_txt.unique())
p.circle(color=colors, legend_group = 'legend_group', source=source)
show(p)
We can look into the actual trees in our forest to see some of the decision choices that the model made.
This tree starts with the duration. Short songs to the left, long to the right. I'm using graphviz to produce this lovely diagram.
graph
Another way to explore our results is to apply PCA to the features, and see where the items marked outlier by our model appear. If the model has done well, we should see our outliers on the margins and away from the center of the PCA plot.
I am randomly downsampling to allow us to visualize the data more easily.
x=X_reduce2b[2]
y=X_reduce2b[0]
z=X_reduce2b[1]
col=X_reduce2b['outlier_color']
fig = go.Figure(data=[go.Scatter3d(
x=x,y=y,z=z,mode='markers',
marker=dict(size=2, color = np.array(col),
colorscale = 'Viridis', opacity=0.8))])
fig.update_layout(
autosize=False,
width=600,height=500,
margin=dict(l=5,r=5,b=5,t=5,pad=1))
fig.show()
I want to check a few other things before calling the job done. Are our anomalies pretty evenly distributed across time periods?
(ggplot(new_X_train, aes(x='year_bin', fill='factor(preds_txt)'))
+ theme_bw()
+theme(axis_text_x=element_text(rotation=45, hjust=1))
+ geom_bar()
+ labs(title = 'Scores by Year'))
<ggplot: (320191229)>
pd.crosstab(new_X_test['preds_txt'], new_X_test['year_bin'])
| year_bin | (1960.941, 1967.556] | (1967.556, 1974.111] | (1974.111, 1980.667] | (1980.667, 1987.222] | (1987.222, 1993.778] | (1993.778, 2000.333] | (2000.333, 2006.889] | (2006.889, 2013.444] | (2013.444, 2020.0] |
|---|---|---|---|---|---|---|---|---|---|
| preds_txt | |||||||||
| Anomaly | 119 | 35 | 30 | 87 | 61 | 73 | 41 | 74 | 171 |
| Normal | 4101 | 4127 | 3579 | 4165 | 3480 | 4100 | 3573 | 4139 | 3942 |
What are the score values looking like, out of curiosity?
(ggplot(new_X_test, aes(x='scores', y='factor(preds_txt)'))
+ theme_bw()
+ geom_bin2d()
+ labs(title = 'Density of Scores'))
<ggplot: (320190654)>
Now what I'd like to do is actually grab some titles and artists and just see what the model has decided is anomalous. Our data is pretty intuitive and the model is not too complex, so we ought to be able to infer what's up.
Remember that the Shapley values said that these were some of our most influential traits:
But we can look at all kinds of features from the trees. We'll start with a random sample.
anoms = new_X_test.query("preds_txt == 'Anomaly'")
tc = anoms.sample(frac=0.01, replace=False, random_state=1).index
dataset.loc[tc][['artists', 'name', 'year', 'speechiness', 'duration_ms',
'liveness', 'acousticness', 'instrumentalness', 'explicit',
'energy', 'tempo', 'valence']]
| artists | name | year | speechiness | duration_ms | liveness | acousticness | instrumentalness | explicit | energy | tempo | valence | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 59667 | ['Riley Lee'] | Return Of The Fisherman | 1984 | 0.0439 | 550360 | 0.0910 | 0.993 | 0.5760 | 0 | 0.0640 | 66.488 | 0.17400 |
| 138403 | ['Stan Kenton'] | Say It Isn't So | 1961 | 0.0340 | 156360 | 0.1580 | 0.971 | 0.9700 | 0 | 0.0392 | 77.986 | 0.03830 |
| 76902 | ['Liquid Mind'] | Whisper to Me | 2009 | 0.0503 | 484013 | 0.0881 | 0.962 | 0.9520 | 0 | 0.1100 | 59.117 | 0.03320 |
| 166675 | ['Public Enemy'] | Countdown To Armageddon | 1988 | 0.0623 | 100533 | 0.9720 | 0.977 | 0.0658 | 1 | 0.5090 | 105.207 | 0.18800 |
| 57517 | ['Bill Cosby'] | The Pep Talk | 1963 | 0.9200 | 105387 | 0.7370 | 0.911 | 0.0000 | 0 | 0.6390 | 105.376 | 0.71000 |
| 143947 | ['The Sleep Helpers'] | Never Ending Rain | 2017 | 0.0603 | 269328 | 0.9650 | 0.881 | 0.5790 | 0 | 0.9980 | 106.706 | 0.00001 |
| 72433 | ['Richard Wagner', 'Wiener Philharmoniker', 'R... | Wagner: Lohengrin, WWV 75, Act 1: Prelude (Lan... | 1964 | 0.0381 | 513600 | 0.1400 | 0.918 | 0.9550 | 0 | 0.0365 | 67.927 | 0.03150 |
Returning to the tree we looked at from our forest, let's think about what the tree paths are producing.
c1 = 'duration_ms <= 272654.606'
c2 = 'tempo <= 125.766'
c3 = 'danceability <= 0.204'
c4 = 'instrumentalness <= 0.1'
tc = anoms.index
full_anoms = dataset.loc[tc]
full_anoms.query(f"{c1} & {c2} & {c3} & {c4}")[['artists', 'name', 'year', 'duration_ms', 'instrumentalness', 'tempo', 'danceability']].head()
| artists | name | year | duration_ms | instrumentalness | tempo | danceability | |
|---|---|---|---|---|---|---|---|
| 125827 | ['Fabricantes De Lluvia'] | Cozy Rain Sound | 2017 | 169331 | 0.000000 | 52.764 | 0.0574 |
| 126004 | ['Crafting Audio'] | Cabin Noise Focus | 2018 | 162407 | 0.013800 | 80.628 | 0.1070 |
| 53453 | ['System Of A Down'] | Suite-Pee - Live at Irving Plaza, NYC, NY - Ja... | 1998 | 178867 | 0.006000 | 67.934 | 0.1970 |
| 93970 | ['Bad Brains'] | Sailin' On | 1982 | 115040 | 0.000047 | 82.083 | 0.1390 |
| 14173 | ['Oasis'] | Fuckin' In The Bushes | 2000 | 199360 | 0.045100 | 100.609 | 0.1470 |
What about another branch? What do we find if we follow that road?
c1 = 'duration_ms > 272654.606'
c2 = 'speechiness > 0.196'
c3 = 'loudness > -3.478'
c4 = 'explicit > 0.606'
full_anoms.query(f"{c1} & {c2} & {c3} & {c4}")[['artists', 'name', 'year', 'explicit','duration_ms', 'speechiness', 'instrumentalness', 'valence', 'loudness']]
| artists | name | year | explicit | duration_ms | speechiness | instrumentalness | valence | loudness | |
|---|---|---|---|---|---|---|---|---|---|
| 24055 | ['Lil Darkie'] | GENOCIDE | 2019 | 1 | 632625 | 0.486 | 0.000000 | 0.604 | 0.457 |
| 36996 | ['Eazy-E', 'MC Ren', 'Dr. Dre'] | We Want Eazy - Remix | 1988 | 1 | 398160 | 0.206 | 0.000000 | 0.632 | -3.067 |
| 114852 | ['Tech N9ne'] | Imma Tell | 2002 | 1 | 285800 | 0.260 | 0.000000 | 0.565 | -3.084 |
| 97597 | ['Nio Garcia', 'Casper Magico', 'Bad Bunny', '... | Te Boté - Remix | 2018 | 1 | 417920 | 0.214 | 0.000013 | 0.442 | -3.445 |
| 125208 | ['Eminem', 'Kobe Honeycutt'] | Talkin’ 2 Myself | 2010 | 1 | 300427 | 0.415 | 0.000000 | 0.494 | -2.160 |
| 115695 | ['Eminem'] | Cold Wind Blows | 2010 | 1 | 303733 | 0.416 | 0.000000 | 0.584 | -0.517 |
| 86853 | ['JAY-Z', 'Alicia Keys'] | Empire State Of Mind | 2009 | 1 | 276920 | 0.392 | 0.000000 | 0.811 | -1.538 |
To recap, here's what we did:
I hope that you can use this in your own projects in the future! Thank you!
https://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_violin.html#plotnine.geoms.geom_violin https://plotnine.readthedocs.io/en/stable/api.html#geoms https://scikit-learn.org/stable/modules/tree.html#tree https://www.graphviz.org/download/ https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html https://plotly.com/python/line-and-scatter/ https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data.csv https://rise.readthedocs.io/en/stable/usage.html https://towardsdatascience.com/anomaly-detection-with-isolation-forest-visualization-23cd75c281e2